STalign: Alignment of spatial transcriptomics data using diffeomorphic metric mapping

Spatial transcriptomics (ST) technologies enable high throughput gene expression characterization within thin tissue sections. However, comparing spatial observations across sections, samples, and technologies remains challenging. To address this challenge, we develop STalign to align ST datasets in a manner that accounts for partially matched tissue sections and other local non-linear distortions using diffeomorphic metric mapping. We apply STalign to align ST datasets within and across technologies as well as to align ST datasets to a 3D common coordinate framework. We show that STalign achieves high gene expression and cell-type correspondence across matched spatial locations that is significantly improved over landmark-based affine alignments. Applying STalign to align ST datasets of the mouse brain to the 3D common coordinate framework from the Allen Brain Atlas, we highlight how STalign can be used to lift over brain region annotations and enable the interrogation of compositional heterogeneity across anatomical structures. STalign is available as an open-source Python toolkit at https://github.com/JEFworks-Lab/STalign and as Supplementary Software with additional documentation and tutorials available at https://jef.works/STalign.


Supplementary Tables
1 Cell density data model For single-cell resolution spatially resolved transcriptomics data, we model the point sets of detected cells in the framework of varifold measures [1].While the theory extends to more complex spaces of features, here we focus on image varifolds [2] by utilizing the locations of cells only, termed the marginal space measure (after marginalizing out features other than spatial location) as defined in [3].Briefly, these space measures are weighted sums of Dirac δ distributions ρ .= Nq i w ρ i δ x ρ i , where x ρ i ∈ R D stores the spatial coordinate of the ith out of N q cells, and w ρ i ∈ R stores its weight.In this work, cell positional data is two dimensional, so D = 2, and with some abuse of notation we sometimes write (x, y) instead of x.
We aim to evaluate the similarity between two single-cell resolution spatially resolved transcriptomics datasets, which we call a source and a target.Note other commonly used terms for source are: template, atlas, or moving image, while another commonly used terms for target is: fixed image.To compute distances between datasets, we embed their corresponding space measures ρ S and ρ T respectively in the dual of a Reproducing Kernel Hilbert Space V * and use the standard operator norm (see for example [4]).For some choice of kernel function k, the norm squared is Here we chose k as a Gaussian with k( The variables w ρ S i S , w ρ S j S , and x ρ S i S , x ρ S j S corresponds to the weights and spatial coordinates of the ith and jth cells in the source, while w ρ T i T , w ρ T j T , x ρ T i T , x ρ T j T correspond to the weights and spatial coordinates of the cells in the target.For simplicity, in the main paper we write (x ρ S , y ρ S ) and (x ρ T , y ρ T ) for source and target points respectively.
In STalign, we initialize weights to 1, though applying nonlinear deformations will modify these weights as discussed below in section 4.

Rasterization
Since computation of this norm is of quadratic complexity in the number of points, we turn to a more efficient representation for computing optimal transformations through rasterization.We can reduce the complexity of our calculations significantly by approximating our space measures through sampling a density signal on a regular grid (known as rasterization), rather than keeping a list of points and weights.
By defining k = k (where * refers to convolution ), the above expression for norm squared (1) can be written as Note that when k is a radially symmetric Gaussian, k 1 2 is also a radially symmetric Gaussian but with half the variance.Here we have defined the smooth density function and ∥ • ∥ 2 is the L 2 norm on functions.Due to the smoothness introduced by k 1 2 , these functions can be accurately discretized by sampling them on a uniform pixel grid at a resolution rate defined by the user and comparable in size to σ.
Without rasterization, evaluating the function I at a given point would involve a sum over every x i , an order N complexity operation.After rasterization the function I can be evaluated at any point in order 1 complexity using bilinear interpolation.This allows the norm to be evaluated by summing over a pixel grid in order P complexity (where P is the number of pixels), rather than a double sum over the points x i in order N 2 complexity.
For example, MERFISH Slice 2 Replicate 3 has 85958 cells, and the rasterized dataset has 336 × 256 = 86, 016 pixels.The Naive approach would involve 7,388,777,764 terms in the sum (pairs of cells), whereas in the rasterization approach there are only 86,016 terms in the sum (pixels).This is an approximately 86,000 times increase in efficiency which occurs for each iteration of optimization, ignoring the negligible time cost of rasterization itself, which occurs only once at the start of registration.
In this section we showed how a rasterized image I can be produced from a list of cell location, in a manner compatible with the theory of varifolds.However, our registration algorithm can be performed with any standard rasterized image type.For example, in the main manuscript we show examples where I T is a red-green-blue image corresponding to a brightfield microscopy image of H&E stained tissue.How such images of different contrast profiles are handled is described in section 5.

Diffeomorphic transformation model
We estimate alignments between two rasterized datasets by applying a transformation ϕ : R D → R D .ϕ(x) .= Aφ 1 (x), the composition of two transformations: a diffeomorphism (a smooth differentiable transformation with a smooth differentiable inverse) φ 1 : R D → R D generated in the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework [5], and an affine transformation A (i.e. a 3x3 matrix in homogeneous coordinates whose upper left 2x2 block is a linear transform and upper right 2x1 block is a translation vector).In this notation Aφ 1 (x) denotes matrix multiplication of the matrix A and the vector φ 1 in homogeneous coordinates.
In the LDDMM framework a diffeomormphism is generated by integrating a time varying velocity field v t over time t ∈ [0, 1], by solving the ordinary differential equation with initial condition φ 0 = id.For identifying alignments, we optimize over v t rather than φ 1 directly, and to emphasize this dependence we use the superscript φ v in the main text.Similarly, we use ϕ A,v to emphasize the dependence of ϕ on both A and v t .As long as v t a smooth function of space, φ v 1 is guaranteed to be diffeomorphic.We enforce this through regularization as described below in section 5.
While this section described how we parameterize our transformations, next we need to describe how they act to deform our datasets, in order to use them in an optimization problem.

Action of transformations on datasets
The action of a transformation ϕ on a space measure dataset ρ moves the spatial coordinate of each cell, and adjusts the weight of each cell based on the transformation's Jacobian determinant.
where dϕ(x) denotes the matrix of partial derivatives of the map ϕ at the point x, and | • | represents the determinant of a matrix.
We note that the standard image action [ϕ • I](x) = I(ϕ −1 (x)) has been well studied theoretically (as a left group action), computationally (in terms of its efficient implementation through interpolation), and application-wise (in terms of its use in a variety of image registration platforms e.g.[5]).This image action for continuous image functions is not appropriate for space datasets and therefore the image action does not match the measure action defined in (5).However, in the dense tissue limit, the continuous image action is consistent with the measure action of (5) as proven in [3].Since the applications shown here provide a dense approximation, this aforementioned consistency motivates us to leverage the continuous image action for its computational advantages.We write the image action as follows: The approximate equality is accurate in our examples when k 1 2 is narrow relative to the smoothness of v t and wide relative to the spacing between cells.The approximate equality would be exact if k 1 2 were a Dirac Supplementary Notes Figure 1: Rasterization followed by deformation with the image action (left), versus naively deforming the positions of cells followed by rasterization (right).Note the intensity changes that occur in regions of high deformation.delta function.If cells are too far apart, a larger value of σ could be chosen.For a particularly sparse set of cells, a different method that does not include rasterization would be more appropriate, for example measure matching [6].
With this formulation, deformations can be applied to smooth density images I using interpolation in order P (number of pixels) complexity.Supplementary Notes Figure 1 illustrates the importance of the Jacobian factor.Without including this factor, transforming a density image alters its brightness, which is typically undesirable: bigger organ tends to have more cells with the same cell density, rather than the same number of cells with a lower density.

Image registration
We compute a spatial alignment between two ST datasets by minimizing the sum of two objective functions: a regularization term R, and a matching term M θ , which we define below.Note that the computation of I is described in section 2, the parameterization of ϕ A,v is described in section 3, and the action ϕ A,v • I S is described in the section 4. Recall that ϕ depends on both the velocity field v and the affine transform A.
Following the LDDMM framework, we regularize our diffeomorphism via where id is an identity matrix, σ 2 R is a user tunable parameter that adjusts balance between matching accuracy and regularization, where large values correspond to less regularization and higher matching accuracy, and small values correspond to more regularization and lower matching accuracy.∆ is the Laplacian, a is a constant with units of length that controls spatial smoothness, and p = 2 is a power that must be large enough to guarantee that results are diffeomorphisms [7].Note that small values of a may be overfitting of noise whereas large values of a may lead to low accuracy.In practice, we chose a value of a based on the spatial smoothness of deformations that we believe to be realistic.We then consider several values of σ 2 R (starting with a value provided by one of our online examples), and chose the one that achieves a reasonable balance between regularization and accuracy.
Our matching takes the form of [8] where σ 2 M is a user tunable parameter that describes the amount of noise in our imaging data (see description of Gaussian Mixture modeling below).
Note that I T need not correspond to a smooth density image as defined in section 2. For example, we include the case where it is a red green blue image corresponding to an H&E stain.
The function f θ is a transformation of image contrast with unknown parameters θ.We use a polynomial for f θ , in which case the minimizing parameters θ can be found exactly by solving a weighted least squares problem.The purpose of this transformation is to model differences in contrast between images from the same modality due to calibration issues; and contrast/color differences between different modalities.
In this work we found that first order polynomials were sufficient for accurate image registration.In other work in neuroimaging we have used 3rd order polynomials, which have enough degrees of freedom to map the intensity of gray matter, white matter, and background to arbitrary intensities [8].Because there are many more pixels than degrees of freedom, it is unlikely that these polynomials will overfit the observed data I T .However, depending on the initialization of transformation parameters this is possible: if tissue in I and I T do not overlap at all, parameters θ may be estimated to zero out imaging information and transform I into a constant function that looks like background only.
The term W is a positive weight that represents the probability that a given pixel in the target image can be matched accurately to one in the source image.For example, if tissue is missing in the target image but not the source image, pixels in the region of missing tissue would get a small weight.Similarly, if the target image included a signal not present in the source (e.g. a bright fluorescence signal).
To optimize E, we alternate between updating W M with Gaussian mixture modeling, and jointly updating (θ, ϕ A,v ) with gradient based methods, using expectation maximization algorithm as discussed in [8].Briefly, we use 3 classes in our Gaussian mixture model (pixels to be matched, background, and artifact).Each is modeled as a Gaussian random variable with an unknown mean (optionally the mean can be assumed known and specified as an input parameter), a known variance (specified as an input parameter), and an unknown prior probability.While the means for background and artifact are constant values, the mean for pixels to be matched is equal to f θ ([ϕ A,v • I S ](x)) and is a function of space.Parameters are estimated by standard Gaussian mixture modeling techniques, and W M (x) is computed as the posterior probability that the pixel at x belongs to the "pixels to be matched" class.If g(x, µ, σ 2 ) is a multivariate normal with mean µ and covariance σ 2 times identity, and π i are prior probabilities for each class (i ∈(matching, background, and artifact)), then In the above expression, note that the denominator shows a mixture of three Gaussians, and the numerator shows the first class in the mixture.In our code we also define W B and W A , which are posterior probabilities that a pixel belongs to the background or artifact classes.They are defined with same denominator as W M , but with numerators corresponding to the mixture component for their class.Recall that updating ϕ corresponds to updating the affine transformation matrix A, and the velocity field v t which generates the deformation φ 1 from (4).After solving for the optimal transformation parameters A and v t , a transformation and its inverse are constructed by solving (4) sampled on a regular grid, using Semi-Lagrangian techniques [9].With ϕ i S in the source image can be mapped into the target by calculating ϕ(x ρ S i S ) through linear interpolation.Similarly, a point x ρ T i T in the target image can be mapped to the atlas by calculating ϕ −1 (x ρ T i T ) through linear interpolation.

Landmark Optimization Option
For improved robustness, our software allows users to input pairs of corresponding points in the source and target images.These points can be used either to initialize the affine transformation A through least squares (steering our gradient based solution toward an appropriate local minima in this challenging nonconvex optimization problem); or can be used to drive the optimization problem itself by modifying to our objective function to be E(A, v) = R(v) + M θ (ϕ A,v • I S , I T ) + P (ϕ A,v (X S ), X T ) such that where X S i and X T i are the ith point of N corresponding points in the source and target respectively and σ 2 P is a user tunable parameter that adjusts balance between matching corresponding landmark points, matching images, and regularization, where large values correspond to less accuracy matching points and small values correspond to more accuracy matching points.Landmark based optimization in the LDDMM framework has been studied extensively (see for example [10]).

3D to 2D alignment
In addition to aligning spatially resolved transcriptomics datasets in which the cell positional information is 2D, we registered the 3D reconstructed Allen common coordinate framework (CCF) atlas (source) to each of the 9 MERFISH datasets (target).The image transformation is similar to the alignment discussed in the section 5 with a few exceptions: It is important to note that all transformations are performed on the source 3D atlas.Since the 50 µm Nissl-stained Allen Brain Atlas CCF v3 was used as the source image, rasterization is not applied to the atlas.The affine transformation A for the 3D-2D alignment is a 4×4 3D matrix in homogeneous coordinates.The space of dense 3D images in the orbit of the atlas, are defined via diffeomorphisms The diffeomorphism ϕ ∈ D acts on the atlas to generate the orbit of imagery I, The velocity field v t is still defined by ( 4), but v t ∈ R 3 .The image I S in the matching term M represents the transformed source atlas evaluated at z = 0, to enable comparison in the same dimension between the source and the target images.

Table 1 .
Description of manually placed landmark locations on ST datasets of the mouse brain.
Lateral-most part of ventral posterolateral nucleus of the thalamus (had to extrapolate in some areas to keep smooth curve) (left/right) ec_L/ec_R Follow the corpus callosum body (ccb) to the lateral-most part where it becomes the external capsule (should be slightly ventral to VPL points) (left/right) dg_L/dg_R Dentate gyrus; follow hippocampal formation from dorsal to ventral until distinct color change.Point should be at the last part of the initial color (centered at the beginning of what looks like an alligator's mouth) (left/

Table 2 .
Runtime estimates for the STalign.LDDMM and the STalign.LDDMM_3D_to_slice functions for different ST data alignments.

. Evaluation of STalign against supervised affine alignment. a.
Spatial agreement of target and source that has been aligned based on a simple affine transformation based on manually placed landmarks.b.Correspondence of gene expression spatial organization between the target and supervised affine aligned source for select spatially patterned genes.c.

Supplementary Figure 9. Cell types in MERFISH dataset. a. Heat map of differentially expressed
genes in cell types defined by Leiden clustering for all cells across 9 MERFISH datasets.
b. Cell types defined by differential gene expression and Leiden clustering, represented by different colors, for three biological replicates of three brain slices.
2, where | • | denotes the standard Euclidean norm, and σ is a user specified kernel width parameter.